Multivariate Segmentation of Time-Series Data
نویسندگان
چکیده
Medical time-series data often contain sets of closely related, non-orthogonal channel – for example transcutaneous O2 and CO2 or mean, systolic and diastolic blood pressures. It is desirable when summarizing such sets of data to select a single set of time-periods dividing the data into segments of distinctive character. This can be achieved using an extension of existing bottom-up segmentation techniques. A cost function over all related channels is defined which quantifies the amount of useful information lost if an adjacent pair of time-periods were to be consolidated. Summarising multiple correlated time-series In the medical domain, it is often the case that a number of closely correlated time-series channels exist – for example transcutaneous partial pressures of O2 and CO2 (see figure 1) or systolic, diastolic and mean blood pressures. Such channels are non-orthogonal – that is they can be considered to lie on axes that are not at right angles. Features present in one channel should to some extent be present in others. Figure 1 – Closely related time-series drawn from the NEONATE data set [1]. Here, TcPO2 (OX) and TCPCO2 (CO) have a strongly correlated inverse relationship. Where there is a need to summarise the features present in such channels in the time domain, it may be useful to produce a single summary of the entire set rather than each channel individually. Summarising time-series data A number of knowledge-based data summary mechanisms exist that are capable of considering multivariate data, for example, knowledge-based temporal abstraction [2] and subsequent developments [3][4]. Such mechanisms depend on domain knowledge to generate sets of useful patterns and relationships to be searched for within a data set, but do not generate a summary of the entire data set. Generic bottom-up segmentation Segmentation of time-series data into periods of similar character is a useful and established mechanism for summarisation. Considering a generic ‘bottom-up’ segmentation algorithm [5][6], an initial set of segments representing the finest possible approximation of the data is constructed; the amount of useful information lost by merging each adjacent pair of segments determined and the ‘cheapest’ of these merges carried out. Merging is then continued until some stopping condition is met – for example the amount of useful information discarded becomes significant. The ‘cost function’ used to determine the relative amount of useful information lost accepts a pair of adjacent time intervals from a single channel and returns a single quantitative measure of the information lost by merging them. The nature of the cost function used depends on the features of the data that are considered useful. For example if first-order trends are important, the sum of deviations of the original data points from a regression or interpolation line over them may be appropriate. Where a more intuitive segmentation is required, a cost function based on visible data features such as landmarks [7] may produce more appropriate results. Segmenting related channels individually Where a number of closely related channels are segmented individually, the desirable result would be identical or nearly identical sets of time-periods – that is, applying the same cost function to an initial set of segments constructed from each Construct an initial set of segments S Repeat Merge the pair of adjacent segments that involve the smallest loss of useful information according to some function Until a stopping condition is met channel in a closely related set should, in general, result in the selection of the same pairs of segments in the timedomain to be merged. In the absence of features that are genuinely not present in all related channels, differences in the segments selected to be merged arise due to error or noise in the data. However, it is the nature of segmentation, and bottom-up segmentation in particular that these small differences are compounded. Taken cumulatively, they can result in notably different segmentations (see figure 2). Figure 2 Closely related time-series segmented separately within the Time-Series Workbench [8]. Segment boundaries differ significantly between the channels. Producing a single summary of a set of channels segmented in this manner is then problematic. The intersection of the resulting sets of segments will contain many small, hard to summarise time periods. The elimination of such segments is difficult to perform without reference to the original data. Multivariate segmentation The generic bottom-up segmentation method described above can be used to segment a set of channels rather than a single one by constructing a suitable cost function a single function that reports the amount of information lost over all channels in the set by merging two adjacent time intervals. Figure 3 Closely related time-series segmented using a single, common cost function. Features common to both channels are reflected in the resulting segments. Multivariate segmentation has a number of advantages. Primarily, small variations in the temporal location of segment endpoints which contain little useful information are suppressed (see figure 3). This in turn allows for a more concise summary of the data to be made. The effects of noise in the response domain on the segmentation algorithm are diminished if the noise is limited to a small proportion of the channels being considered. A simple cost-function for multivariate segmentation The selection of a suitable cost function for use in multivariate segmentation raises a number of questions. Consider the simplest cost function – a summation over all channels of the error reported by a single-channel cost function. It is necessary to normalise the measures of information lost reported by the single-channel cost function to prevent that from a single channel unduly influencing the segmentation process. Further, the normalization must be carried out consistently over the entire time-period being segmented as the segmentation algorithm compares values returned by the function for different time intervals. Without prior knowledge of the entire data-set being segmented, it is not possible to properly normalise the data or measures of lost information – at best it is possible to estimate the mean and variance of each channel from the portion of data available, or by using domain knowledge such as the expected mean blood pressure of the patient to whom the data relates. Where some channels are considered to be more representative of the data set, or to contain information more valuable to the summary than others, it may be useful to weight the summation of values returned by the single-channel cost function for those channels. For example if mean blood pressure is considered to be less prone to noise in the response domain than related systolic and diastolic values, magnifying its influence on the segmentation process would result in a corresponding decrease in sensitivity to noise from the lessrepresentative channels. The effects of uncommon features Multivariate segmentation by this method is essentially an averaging process. In averaging the amount of information lost by merging to adjacent time intervals in all channels in a related set, sensitivity to features not common to the entire data set is necessarily reduced. Where valuable features are known to exist in a subset of the set of channels being considered, it may be desirable to ‘boost’ the influence of outlying values returned by summing some function – for example an exponentiation of values reported by single-value cost functions. Alternatively, it may be useful to carry out multivariate segmentation to generate a small number of larger intervals then segment the resulting intervals one channel at a time to pick out any local features of interest. Those channels within each interval contributing the greatest amount to its overall measure of information lost are the most likely to benefit from this treatment. Construction of single composite channels Compared to segmenting time-series channels individually, the method of summarising related sets of time-series data described above benefits from making earlier use of common features of the data. It should be possible to achieve a very similar effect by constructing a single composite time-series channel. Values from all channels in a closely related set are projected onto its most common axis. The single resulting time-series can then be segmented using a normal singlechannel segmentation process. The construction of such a composite channel suffers fromthe same normalization problem described above. It is notpossible to determine the most representative axis on which toproject the channels without either examining the entire dataset or introducing domain knowledge.References [1] Hunter J, Ewing G, Ferguson L, Freer Y, Logie R, McCueP, McIntosh N. "The NEONATE: Database". In: Abu-Hanna and J Hunter, eds. Working Notes of the JointWorkshop on Intelligent Data Analysis in Medicine andPharmacology and Knowledge-Based Information Man-agement in Anaesthesia and Intensive Care, held at AIME'03, 9th European Conference on Artificial Intelligence inMedicine, 2003. pp. 21-24.. [2] Shahar Y, Musen Mark. Knowledge-based temporal ab-straction in clinical domains. In Artificial Intelligence inMedicine 8, 1996. pp. 267-298 [3] Shahar Y. A framework for knowledge-based temporalabstraction. In Artificial Intelligence 90, 1997. pp. 79-133 [4] X Liu, Swift S, Tucker A, Cheng G, Loizou C. ModellingMultivariate Time Series. In Intelligent Data Analysis inMedicine and Pharmacology (IDAMAP-99), 1999. [5] Keogh E, Chu S, Hart D, Pazzani M. An Online Algo-rithm for Segmenting Time Series. In: Proceedings ofIEEE International Conference on Data Mining, 2001.pp. 289-296. [6] Hunter J, McIntosh N. Knowledge-based event detectionin complex time series data. Artificial Intelligence inMedicine. Springer, 1999. pp. 271-280. [7] Perng CS, Wang H, Zhang SR, Parker DS. Landmarks: Anew model for similarity-based pattern querying in timeseries databases. In Proceedings of the 16th InternationalConference on Data Engineering (ICDE), 2000. pp 33-42. [8] http://www.csd.abdn.ac.uk/~jhunter/research/TSW/ Address for correspondence
منابع مشابه
Missing data imputation in multivariable time series data
Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...
متن کاملAn Empirical Comparison of Distance Measures for Multivariate Time Series Clustering
Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...
متن کاملMultivariate Segmentation of Time Series with Differential Evolution
A new method of time series segmentation is developed using differential evolution. Traditional methods of time series segmentation focus on single variable segmentation and as such often determine sections of the time series with constant slope (i.e. linear). The problem of segmenting multivariate time series is significantly more involved since several time series have to be jointly segmented...
متن کاملIdentification of outliers types in multivariate time series using genetic algorithm
Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...
متن کاملSegmenting Big Data Time Series Stream Data
Big data time series data streams are ubiquitous in finance, meteorology and engineering. It may be impossible to process an entire “big data” continuous data stream or to scan through it multiple times due to its tremendous volume. In Heraclitus’s well-known saying, “you never step in the same stream twice,” and so it is with “big data” temporal data streams. Unlike traditional data sets, big ...
متن کاملGreedy Gaussian Segmentation of Multivariate Time Series
We consider the problem of breaking a multivariate (vector) time series into segments over which the data is well explained as independent samples from a Gaussian distribution. We formulate this as a covariance-regularized maximum likelihood problem, which can be reduced to a combinatorial optimization problem of searching over the possible breakpoints, or segment boundaries. This problem is in...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004